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ABSTRACT 

In this work we apply the Internal Standard-based 
analytical approach that we described in an earlier 
communication and here we demonstrate experi- 
mental results on functional associations among 
the hypervariably-expressed genes (HVE-genes). 
Our working assumption was that those genetic 
components, which initiate the disease, involve 
HVE-genes for which the level of expression is un- 
distinguishable among healthy individuals and indi- 
viduals with pathology. We show that analysis of the 
functional associations of the HVE-genes is indeed 
suitable to revealing disease-specific differences. 
We show also that another possible exploit of 
HVE-genes for characterization of pathological 
alterations is by using multivariate classification 
methods. This in turn offers important clues on nat- 
urally occurring dynamic processes in the organism 
and is further used for dynamic discrimination of 
groups of compared samples. We conclude that 
our approach can uncover principally new collective 
differences that cannot be discerned by individual 
gene analysis. 

INTRODUCTION 

The microarray technology has revolutionized the study of 
biology by allowing for simultaneous examination of 
thousands of genes — the total genome expression profile. 



However, the most exciting prospect is to characterize the 
organism as a whole by defining the functional associ- 
ations among their genes. It turns out that it is not 
possible to visualize genetic associations in a steady 
state. To understand the dynamic features of interest, 
the underlying system must be stimulated to elucidate 
the features of the biological regulatory networks. 
A common practice in experimental biology has been to 
make single, stepwise changes in one variable at a time 
and to follow the system's response as it proceeds from 
an initial steady state to a final steady state. 

Although such changes lead to results that are interpret- 
able from a biochemical point of view, step changes do not 
persistently excite the network since most of the data will 
be biased because of approaching the new steady state. As 
a result, many dynamic features remain unidentified, even 
with extensive prior knowledge. Capturing the multivari- 
ate nature of biological regulatory networks requires the 
introduction of multivariate random perturbations, espe- 
cially when the underlying data contain high levels of 
noise. As it was shown earlier (1), random, independent 
inputs enable better identification of relevant results, and 
such identification is more robust to noise. 

In most biological systems, random stimulations from 
the environment continue throughout the life span of the 
organism, and the organism persistently reacts in turn to 
such random stimulations. Genes participating in this 
reaction are in dynamic states. Thus, it is possible to 
reveal genes displaying an extraordinarily high variability 
of expression, and we call these genes 'hypervariably 
expressed genes' or HVE-genes. It has been shown that 
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even in genetically identical individuals; tissues display a 
considerable degree of variation in gene expression (2). 
There are multiple reasons for the extreme variability of 
such genes. For example, previously unrecognized 
heterogeneities could be present in the presumably homo- 
genous group of samples, or there may be genes that are 
involved throughout different phases of internal dynamic 
processes. 

Genetic diseases are often associated with the manifest- 
ation of profound genetic variations. Hence, under such 
conditions increased variability of some genes will be 
expected, although the association of these genetic vari- 
ations with transcriptional changes cannot be directly 
inferred. Genes that demonstrate variability in expression 
at the population level could be potential candidates for 
further studies of the genetic architecture of complex traits 
associated with pathology, especially if these gens display 
intra-individual stability. In this context, it is interesting to 
note that gene expression variability is often increased in 
autoimmune pathologies and is normalized again after 
successful treatment [see e.g. (3-5)]. 

Examples of significant increases of the proportion of 
HVE-genes in various inflammatory pathologies include 
lupus, rheumatoid arthritis and TNF Receptor 
Associated Periodic Syndrome (TRAPS). Because 
TRAPS is a rare autoinflammatory disorder caused by 
mutations in the extracellular domain of the TNF 
receptor superfamily 1A, one does expect to observe dif- 
ferences in gene expression variability when comparing 
TRAPS patients with healthy donors. Indeed when 
comparing 14 TRAPS patients with a counterpart of 14 
healthy donors, 124 genes displayed increased expression 
variability in the samples from TRAPS patients 
(Figure 2A). Many of these genes are members of the 
TNF receptor pathway and are associated with inflamma- 
tory processes (as shown by the Ingenuity Pathway 
Analysis presented in Supplementary Figure SI). It is of 
interest that among the outlined entities, Mediterranean 
fever gene (MEFV) is present — a hallmark of another 
close to TRAPS pathology — Mediterranean fever (6). 

The most prominent problem in studying HVE-genes is 
the lack of statistical methods to facilitate the selection of 
HVE-genes from microarray experiments in which sample 
sizes are too small to use standard statistical techniques. 
Variable gene expression can be a characteristic feature of 
pathology, but the lack of adequate methods for multi- 
variate analysis complicates the interpretation of the 
obtained results, especially regarding the reproducibility 
and reliability of the established features (7,8). The 
reasons behind these objections include the instability of 
existing methods and sample sizes that are too small to 
support the notion of reliable variability features. 

We demonstrated earlier (9), that many problems of 
genome-scale microarray experiments, which appeared to 
be consequences of the vast amount of information, were 
successfully resolved by the use of the Internal Standard 
strategy. In this method information about nonspecific 
variations is dissociated from the conventional behavior 
of genes that share certain features, such as equity in ex- 
pression, stability and distinctiveness from background 
noise. Knowledge of the parameters governed by 



Internal Standards is an added benefit to statistically 
robust analyses of functional associations by clustering 
and networking genes. 

In this communication, we present the application of 
the Internal Standard strategy to HVE-gene selection 
and a functional analysis based on strong statistical 
criteria. Rather than presenting an orderly, methodologic- 
al approach, we assembled data obtained throughout 
several research endeavors, and we present the actual 
results from applying multivariate procedures to the 
analysis of HVE-genes in both normal and pathological 
processes. 

Programs created for the selection and analyses of the 
features of the HVE-genes are implemented in MatLab 
(Mathworks, MA, USA) and available from authors 
upon request. 



MATERIALS AND METHODS 

Gene expression data sets 

This work uses a wide spectrum of experimental data. The 
actual biological portion of the experiments was per- 
formed in a collaborative manner separately for each 
sub-project, and portions of them have already been 
reported in independent publications or are in preparation 
for publications. The common denominator of each of 
these projects is the evaluation procedure. Expression 
data sets were obtained using various sources of mRNA 
and several microarray technologies. Fragmented descrip- 
tions of the experimental protocols and the microarray 
experiments are given in Table 1 and in the 
Supplementary Data. The reason for compiling multiple 
diverse biological experiments into a single paper is to 
allow the output microarray data from these experiments 
to be analyzed using the Internal Standard-based analysis 
procedure. 

Microarray data analysis 

The methods used for gene expression analysis are based 
on the use of Internal Standards, which are constructed by 
identifying a large family of similarly behaving genes. The 
application of these Internal Standards to the normaliza- 
tion of microarray data and the differential analysis of 
gene expression was presented in the first part of this 
project (9). 

The normalization procedure consists of two subse- 
quent steps: 

• The first step is the determination of the parameters of 
the background of the array — the average (Av) and 
standard deviation (SD) of normally distributed low 
level expressions in an array with subsequent normal- 
ization of all expressions in the array. A normalized 
score, 'S,' is obtained [S = (PV-Av)/SD], where PV is 
the original pixel value for the spot, and Av and SD 
are the mean and standard deviation respectively, of 
the set of background spots. The distribution of S has 
zero mean and SD = 1 over the set of background 
genes in the normalized array. Only genes expressed 
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above background (>3 SDs) are used for the second 
step 'adjustment'. 

• The second step is the adjustment of the normalized 
profiles to each other by robust regression analysis of 
genes expressed above background. This procedure is 
based on the selection of equally expressed genes as a 
homogenous family of genes, with normally 
distributed residuals defined as deviations from the re- 
gression line. Outliers are thereafter determined as 
genes having deviations not associated with this 
internal standard of equity in expression, which 
include thousands of members. 

• For multi-sample data adjustment an averaged profile 
is calculated and each sample is adjusted to the 
averaged profile using the robust regression procedure 
described above. A new averaged profile is calculated 
from transformed profiles of the samples and the 
adjustment procedure is repeated. Several subsequent 
adjustment may be necessary for the best result, 
however for the data initially normalized to back- 
ground two steps of adjustment are usually sufficient. 

One of the most important criteria in the selection of 
HVE-genes and the analysis of their behavior is the choice 
of the 'Reference Group' — which is an Internal Standard 
for equity in expression and for stability of the analyzed 
processes (absence of variability exceeding technological 
and biological noise). 

Procedure for establishing the 'Reference Group' 

The Reference Group is constructed by identifying a set of 
genes expressed above background level with inherently 
low variability as determined by an F-test. The procedure 
consists of two steps; the first step ensures that an absolute 
majority of stable genes are identified, while the second 
step ensures that the outliers are excluded with a simple 
iterative procedure. At the beginning, all genes are repre- 
sented by their residuals (relatively averaged profile), 
which after normalization and log transformation lose 
their sample-dependent individuality as well as their ex- 
pression level-dependent individuality (Figure 1A). For 
the majority of genes, the variation between replicates is 
relatively small and homogenous and follows the standard 
^-distribution. A small portion of genes that exhibit high 
variation (statistically distinct from the rest) are the 
HVE-genes. To obtain the Internal Standard for gene 
variability, HVE-genes should be excluded by an iterative 
procedure (9). The F-test is used as the criterion for the 
exclusion of outliers, i.e. genes that exhibit an estimated 
variability that is considerably higher than that that of the 
total group. The total group variability is recalculated 
after each exclusion step, and the procedure is repeated 
until no additional genes can be excluded by this proced- 
ure. The statistical threshold for the exclusion of 
HVE-genes is chosen such that these exclusions are 
based on an exceptional F-value (usually F<0.05). The 
completion of all the exclusion process a new Internal 
Standard called the 'Reference Group', which is 
composed of genes expressed above the background of 
control samples with a low variability of expression (as 
determined by an F-test) and whose residuals approximate 



a normal distribution. Though not all excluded genes are 
HVE-genes, we can be sure that the majority of them are 
excluded and will not interfere with the estimation of par- 
ameters for the rest of the analysis. The Reference Group 
is further used for selection of HVE-genes and for analysis 
of their functional associations in clustering and network- 
ing procedures. 

List of four resumes of calculations steps 

Upon providing in the 'Result' section detailed explan- 
ations and arguments about the chosen path of calcula- 
tions, procedures summarizing the calculation steps are 
presented in four sequential step-by-step resumes. 

Step-by-step Resume 1: Associative analysis of differ- 
ences in gene expression variations. 

Step-by-step Resume 2: F-means cluster analysis of 
HVE-genes co-expression. 

Step-by-step Resume 3: Correlation mosaic analysis of 
HVE-genes co-expression. 

Step-by-step Resume 4: Networking procedure based on 
the use of partial correlations. 



RESULTS 

All of the experiments described in this communication 
were analyzed using the Internal Standard approach, 
which has been described in our earlier paper (9), in com- 
bination with other methods. 

Selection of 'hypervariably expressed genes' 

Upon establishing the Internal Standard of biological sta- 
bility (Figure 1A) the selection of HVE-genes was made 
using strict statistical criteria. HVE genes were identified 
as those for which the expression level varied significantly 
(P < Po) when comparing the variability of individual 
genes to the variability of the 'Reference Group'. The 
threshold Po was chosen either in a restricted manner 
(Po < 1 IN, where N is the number of all genes expressed 
significantly differently from background noise) or in a 
moderate manner (Po < 0.05), depending on the purpose 
of the subsequent analysis. Choosing the threshold as 
Po < 1 /N (N was often more than half of all genes on 
the array) can be considered to be a slight modification 
of the Bonferroni correction for multiple hypothesis tests. 
Such a choice excludes virtually all false positives, but 
consequently loses many true positives as well. This 
choice should be made when selecting HVE-genes that 
are unique to any given group. In situations in which 
the traditional P = 0.05 is applied, many false positives 
will be retained. Nevertheless, this choice can be useful 
when studying HVE-genes that reproducibly appear in 
several groups, cluster together or reproducibly intercon- 
nect in a subsequent networking procedure. All of these 
subsequent steps refine the list of HVE genes to only 
those that demonstrate some reproducible features that 
are probabilistically less likely to be present in false 
selections. 

Hyper-variations appearing from experimental errors 
(the influence of dirty spots) were statistically filtered 
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Figure 1. F-means clustering procedure. (A) The standard deviations of genes from the Reference Group, with HVE-genes (red bars) included. 
(B) Gene content of the cluster with seeding profile shown as a red line. (C) Deviations of genes' profiles from the seeding profile (shown as red SD 
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(loglO presentation), (B) The sample numbers. Ordinate: (A) and (B) Gene expression deviations from the equity of expression; (B) Gene expression 
levels in samples normalized to have zero mean (over all samples) and SD = 1. 



from this analysis by comparing the variability of the re- 
siduals in a replicated group of samples with the same 
variability obtained by excluding both the maximum and 
minimum one at a time. A statistically significant decrease 



in variability after excluding one replicate provides 
evidence of a possible error in that particular replicate. 
Such genes are excluded from the family of HVE-genes 
as being falsely selected. 
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Increased gene expression variability associated with 
pathologies 

In replicated microarray experiments, each gene in the 
array can be characterized by two independent param- 
eters: the level of expression and the variability (except 
in regions of low-intensity spots that are abundantly 
contaminated with highly variable background noise). In 
addition to the conventional comparison of gene expres- 
sion levels, it is possible to compare their variability using 
strict statistical criteria. The conventional statistical 
method for comparison of variability, ANOVA, encoun- 
ters the same obstacles when applied to the analysis of 
microarray experiments containing immense amount of 
information. The conventional low statistical threshold 
(P < 0.05) will produce a large output of false positive se- 
lections, whereas any profound adjustments of this thresh- 
old will result in the loss of sensitivity of the statistical test. 
The practice of using the Internal Standard resolves this 
problem with the same efficiency as was achieved for dif- 
ferential gene expression analysis (9). 

Selecting genes with different variabilities relies on the 
next statistical steps. First, the F-test was used to identify 
HVE-genes in each group of samples. Next, the differences 
in their variability were determined in a paired 
comparison. 

Resume 1: Differential analysis of gene expression 
variability. Two groups are considered: Group 1 has 
n chips and k genes, while Group 2 has m chips and k 
genes. 

Data is first normalized as described in the 'Materials 
and Methods' section and presented in log-transformed 
form, making the variability of the majority of genes in- 
dependent of the level of their expression. 

• Reference groups are created for each group of 
samples (Groups 1 and 2) and HVE-genes are selected 
in each group as previously described. (Associative 
F-tests, with m + k-2 degrees of freedom (a = r), to 
establish if the gene associates/belongs to the group 
of stably expressed genes). 

• A paired F-test is performed on the genes selected as 
HVE-genes in both groups (Groups 1 and 2, compari- 
son of the SDs for the same gene in two groups — with 
n + m-2 degrees of freedom and threshold corrected 
for the multiple hypothesis tests), to determine whether 
the genes have equal SDs. 

• Additional restrictions on the fold change and the 
minimal average level of expression may applied. The 
data are grouped into five sets: 

BO: HVE-genes without differences in variability in 

the case-control comparison 
Bl: HVE-genes having significantly higher variation 

in the Experimental group 
B2: HVE-genes having significantly higher variation 

in the Control group 
B3: Genes that exhibit the HVE property only in the 

Experimental group 
B4: Genes that exhibit the HVE property only in the 

Control group 



The ratio of SDs for HVE-genes in groups Bl and B2 
was used to exclude changes that are statistically signifi- 
cant but are not biologically significant. The fold change 
restriction was usually applied as an addition to the stat- 
istical analysis to draw attention to the most prominent 
differences. Upon excluding Bo, all other groups (Bl -B4) 
contain genes that exhibit some characteristic differences 
in the variability of expression level when comparing 'ex- 
perimental versus control'. These genes also establish a 
pathology-specific fingerprint. Unique variable genes 
from the B3 group are of special importance in addressing 
questions about dynamic processes associated with any 
given pathology. 

To understand the mechanisms behind a disease, one 
should first attempt to establish whether disease-specific 
differences in gene variability are the consequence or the 
cause of the pathology. The superfluous variability of 
normally stable genes as well as the 'freezing' of genes 
predicted to participate in dynamically adaptive reactions 
could provide clues towards the understanding of the 
pathology. 

Increased variability can also be of a non-genetic, 
physiological nature; and one might expect that many 
pathologies, such as inflammation, that are associated 
with a burst of dynamic changes are also accompanied 
with a considerable increase in the portion of genes that 
display high variability. 

Examples of significant increases in the proportion of 
HVE-genes in various inflammatory pathologies include 
lupus, rheumatoid arthritis and TRAPS. Because 
TRAPS is a rare autoinflammatory disorder caused by 
mutations in the extracellular domain of TNF receptor 
superfamily 1A, differences in gene expression variability 
are expected when comparing TRAPS patients with 
healthy donors. Indeed, when comparing 14 TRAPS 
patients with a counterpart of 14 healthy donors, 124 
genes were found to display increased expression variabil- 
ity in the samples from TRAPS patients (Figure 2A). 
Many of these genes are members of the TNF receptor 
pathway and are associated with inflammatory processes 
(as shown by the Ingenuity Pathway Analysis presented in 
Supplementary Figure SI). It is of interest that 
Mediterranean fever gene (MEFV) is present among the 
outlined entities. This gene is associated with 
Mediterranean fever, a disease with similar pathology to 
TRAPS (6). 

Increased variability may be associated with the devel- 
opment of pathology. Figure 2B presents the appearance 
of uniquely variable genes in the course of the transform- 
ation of endometrial cells into cancer cells by the action 
of the carcinogen DMBA (7,12-dimethylbenz[a]anthra- 
cene) (10). 

Increased variability may also be observed in 
pathologies that are less dynamic than inflammatory con- 
ditions, for example, chronic pathologies that are not 
associated with a burst of dynamic changes. Figure 2C 
presents genes that demonstrate stable expression levels 
in B cells from normal healthy donors and extreme vari- 
ations in samples from patients with B cell chronic 
lymphocytic leukemia (non-mutated and mutated sub- 
groups) (11). 
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The seemingly chaotic behavior of gene expression vari- 
ation in various pathologies could in fact be a result of the 
superposition of several co-expressed groups of genes. An 
example of this phenomenon is presented in Figure 2D, 
where a group of variable genes in Juvenile Rheumatoid 
Arthritis (JRA) patients reveal closely related 
co-expression patterns. 

The set of genes that are uniquely expressed in any given 
pathology is referred to as the 'fingerprint' or 'signature' 
of the particular pathology (12). We extend this definition 
to refer to the set of uniquely variable genes and coin the 
expression 'functional fingerprint'. 

An interesting example of a 'functional fingerprint' in 
autoimmune pathologies was obtained using lupus prone 
mice. We compared mice with the Slel mutation, which 
makes them susceptible to the development of lupus-like 
pathology, with mice possessing an additional Slesl 
mutation that in turn cancels the effect of the first Slel 
mutation (13-15). We found that in B220 + cells, 35 genes 
that were stable in healthy animals, became variable in 



B6Slel mice and again reverted into stable form in 
B6Slel Slesl mice (Supplementary Figure S2). In CD4 + 
cells, changes in variabilities of 150 genes was associated 
with the Slel mutation. 

F-means clustering for inferring functional 
interconnections 

There are diseases in which differences in HVE-genes 
occur at particular stages of disease manifestation, while 
no distinctive differences are evident at the onset. The only 
means of revealing pathology-specific differences is 
through the analysis of functional associations for such 
HVE-genes. The most commonly used computational 
approach to analyzing such functional associations is 
cluster analysis. 

^-means cluster analysis of HVE-genes is an unsuper- 
vised method, in which every decision, including the selec- 
tion of variable genes, the search for the optimal number 
of clusters, as well as optimization of the distribution of 
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genes over clusters, is solved using statistical criteria. If we 
know the precise differences in the gene expression levels 
among the samples, we would have a 'true' clustering. The 
residuals from the Reference Group provide an empirical 
estimate of the error of the distribution, or the 'noise' in 
the data. 

F-means clustering of HVE-genes was initiated by 
defining a parameter called the connectivity, which is 
defined as the number of genes that vary in expression 
in a similar manner as the 'seed' gene. Clusters then 
were nucleated starting with genes of highest connectivity. 
Genes of lower connectivity were included in a given 
cluster if their expression levels deviated from the 
seeding profile without exceeding the variation of the re- 
siduals in the Reference Group based upon an F-test 
(Figure IB and C). The number of different clusters was 
determined by the experimental system's ability to distin- 
guish differences exceeding random fluctuations of the 
normalized residuals in the Reference Group. 

Resume 2: F-means cluster analysis of the coexpression of 
HVE-genes. The clustering procedure consists of the fol- 
lowing steps: 

• Gene expression normalization, log-transformation 
and rescaling as noted above. 

• Selection of HVE-genes. Exclusion some of them 
whose extreme variability was produced by the devi- 
ation from stable state in only one sample to minimize 
the influence of technical errors. 

Determination of the connectivity, for each of these 
HVE-genes. Connectivity is defined as the number of 
genes whose expression patterns does not vary from the 
expression pattern of a given gene within the ranges 
derived from the Reference Group (based on the F-test). 
The appropriate correction of threshold for the F-test 
should be used to diminish the proportion of false 
positive selections (Po < l/N, N- number of HVE-genes). 

HVE-genes for each group are sorted by their connect- 
ivity and the clustering process begins with the genes ex- 
hibiting the highest connectivity. The first cluster contains 
the gene with the highest connectivity and all genes whose 
deviations from the expression of this gene in each sample 
have variabilities that do not exceed the variability of the 
Reference Group. The next gene of higher connectivity 
not belonging to the first cluster acts as the starting 
point for Cluster #2, and other genes are included in this 
cluster using the same criteria as in the first cluster. This 
process continues until all genes are analyzed. Genes that 
appeared in more than one cluster are considered to be 
likely functional links among these clusters. Genes that 
have zero connectivity do not belong to any cluster. 
Additional restrictions on the choice of the thresholds 
for statistical tests and the minimal cluster content can 
be elicited from simulation experiments where the gene 
expression data are replaced with random data having 
the same characteristic parameters (average and 
standard deviation). The use of simulated data establishes 
the minimal cluster content that appears by chance at the 
chosen statistical thresholds. 



Three potentially different results are distinguished: 

• functional associations for genes from the B4 set are 
characteristic of dynamic processes that prevail under 
normal conditions and are absent in pathology; 

• functional associations appear under pathological con- 
ditions only for genes from the B3 set, are uniquely 
variable in the pathological group and are stable in the 
normal control group 

• functional associations for genes from the BO, Bl and 
B2 sets are significantly modulated in one of the 
compared groups (normal control or pathology). 

Hypervariably expressed genes demonstrate similar 
patterns of variations 

The co-expression of HVE-genes or similarities in their 
expression profiles are of particular importance to under- 
standing the biological significance of these findings. The 
idea that co-expression of genes revealed by the clustering 
procedure implies the participation of these genes in 
general biological processes was first formulated by the 
group of Eisen (16). An extension of this idea is that the 
same should be true for HVE-genes, whose different level 
of expression can be considered as snapshots of some dy- 
namical process. In contrast to temporal dynamics, the 
actual shape of the cluster in the case of HVE-genes is 
of lesser significance as shown in Figure 3. Even if 
HVE-gene expression in each sample is consistent with 
some phase of a dynamic process, the absence of informa- 
tion about the real sequence of events makes the shape of 
the profile useless. 

Several practical examples demonstrate the consistent 
characteristics of the variation in the expression levels of 
the group of clustered genes. The first example was 
obtained from analysis of gene expression in T lympho- 
cytes from a homogenous group of mice. Figure 4 dem- 
onstrates that dozens of genes with significantly high 
variations in their expression levels could be gathered in 
clusters. The very high content of these clusters excludes 
the possibility of chance variations. 

Another example of co-expression of HVE-genes was 
obtained through analysis of gene expressions in samples 
from TRAPS patients (Figure 5). The majority of genes in 
the biggest clusters in samples from two entirely unrelated 
groups — healthy controls and TRAPS patients — had iden- 
tical co-expression patterns. The largest clusters in the 
control group and in the group of TRAPS patients 
consist of 163 and 51 genes, respectively. We applied the 
same technique to F-means clustering in groups produced 
from controls and patients by substituting of real data 
with random values having the same averages and SD 
for each gene. The largest cluster obtained in this simula- 
tion procedure was 10 times smaller than the largest 
cluster in the actual control group, and no genes were 
found to cluster in the simulated patient group. Similar 
results were found when comparing the eight largest 
clusters obtained from the analysis of real and simulated 
data (Figure 6). 

Another example was created earlier in the course of 
gene expression analysis in samples of children with 
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sample # 

Figure 3. Shapes of the HVE gene expression profiles does not have 
sense. Diagrams illustrating the formation of the cluster profiles of 
HVE-genes in a homogeneous group. (A) Possible assortment of nine 
samples representing two dynamical processes with participation of 
several genes, each of whose profiles are shown in either red or 
black. (B) Variant of A in which the order of the samples is arbitrarily 
changed. The exact shape of the dynamical process is lost after such 
rearrangement, but the fact of gene co-expression is still evident. 



polyarticular JRA and normal healthy controls 
(27 samples altogether) (17). In this work the sizes of the 
HVE-gene clusters also significantly exceeded the sizes of 
clusters identified in the simulation experiment. 
Additional validation of the biological meaningfulness of 
partitioning HVE-genes into clusters was obtained by 
analyzing of the cluster contents. The two biggest 
clusters consisted exclusively of genes encoding ribosomal 
proteins, while others consisted of genes encoding general 
regulatory proteins, such as insulin and NF-kB, and also 
of protein involved in mitochondrial protein synthesis, 
proteasome and mini-chromosome maintenance DNA 
replication complex. Furthermore, many co-expressed 
genes shared a common function; for example genes 
encoding numerous glycolytic enzymes and genes 
involved in the tricarboxylic acid cycle. (17) 

We have reported many other examples of employing 
F-means clustering for the analysis of clinical and experi- 
mental data in a series of publications (17-20). 



Correlation mosaic analysis to visualize changes in 
cluster associations 

Both the reproducibility and significant differences in the 
clustering results are usually estimated visually, or quali- 
tatively. Here, we present correlation mosaic based visu- 
alization of global patterns in expression data with 
individually presented interconnections between patterns 
and genes. This approach can be used as an independent 
clustering procedure or as an addition to the completed 
F-means clustering results. In this example the clustering 
procedure is based on the Pearson correlation and consists 
essentially of the sequence of operations used in F-means 
clustering described above. The primary difference is that 
instead of using deviation variability as a measure of 
distance, we use a correlation coefficient. The number of 
clusters and the cluster contents are determined using a 
threshold that can be established in simulation experi- 
ments. The output of this procedure consists of three 
data sets: first, cluster allocation for all genes in the 
analysis, second, connectivity parameter for each gene, 
and third, matrices of correlation coefficients. Matrices 
of correlation coefficients can be represented in a graph- 
ical form known as a correlation mosaic, which is conveni- 
ent for the visual inspection of the differences in gene 
associations between cases and controls. 

Resume 3: Correlation mosaic analysis of the co-expression 
of HVE-genes. The procedure consists of the following 
steps: 

• Normalization of gene expression and identification of 
HVE-genes is conducted as in Resume 1. HVE-gene 
expression data are presented in normalized units. 

• A connectivity parameter is defined for each 
HVE-gene as the number of other genes whose expres- 
sion profiles correlate with any given gene above the 
threshold 'tr'. The appropriate choice of threshold is 
obtained in simulation experiments. 

• HVE-genes in each group are sorted by their connect- 
ivity, and the clustering process begins with genes of 
the highest connectivity. The gene with the highest 
connectivity and all genes that deviate from this 
gene's expression in each sample with variabilities 
not higher than the variability of the Reference 
Group comprise Cluster #1. The next gene not belong- 
ing to the first cluster and genes selected as not signifi- 
cantly deviating comprise Cluster #2. The process 
continues until all genes are analyzed. Genes that 
have zero connectivity do not belong to any cluster. 

• The result is presented as a color-plot with the gene 
numbers used as the coordinates along the axes, with 
the same ordering G\. . G n used along the abscissa and 
the ordinate). 

• When the correlated gene associations are compared 
between two groups of samples, the order of 
coordinated genes is the same in both mosaics. 

This correlation mosaic method was applied to the 
analysis of gene expression data and cytokine multiplex 
data in clinical and experimental samples (17-26). In the 
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Figure 4. f-means clustering of gene expressions in T cells from B6 mice. The six largest clusters are shown. Abscissa: cluster numbers derived from 
10 samples from 10 different mice. Ordinate: the normalized expression levels. Figures in brackets: the numbers of genes in each cluster. 



very first example a mouse model of bladder inflammation 
was used to investigate the role of neurokinin 1 receptors 
(NK1R) and neprilysin (NEP) in neurogenic inflamma- 
tion. Cystitis was induced in wild-type mice sensitized to 
human serum albumin after being challenged with the 
same antigen. Microarray analysis revealed that inflam- 
matory processes in wild mice-type led to a 
downregulation of neprilysin expression. The most prom- 
inent cluster of activator protein 1 (AP-l)-responsive 
genes included neprilysin (upper portion of Figure 7). In 
contrast, NKiR _/ mice failed to mount an inflammatory 
reaction and the presence of neprilysin negatively 
correlated with the expression of the same gene(s) in 
wild-type mice (bottom Figure 7). The switching of NEP 
correlations from positive in wild-type mice to negative in 
NK1R -,L mice is very convincing in this presentation. 
This work (21) provided a suitable model for elucidating 
the involvement of AP-1 transcription factor in bladder 



inflammation and suggested a testable hypothesis regard- 
ing the role of NK1R and NEP in inflammation. 

• The correlation mosaic analysis also was applied to 
HVE-genes in JRA data as given above. Figure 8 
presents an outstanding visualization of the changes 
in some gene associations with other cluster members 
during the course of treatment of JRA patients. 
Analysis of the healthy donor group (HD group) 
reveals the presence of two highly correlated clusters 
of genes. The color variation in the mosaic visualizes 
the differences among the healthy donors (HD), 
non-treated (AD) and treated partially-responding 
(PR) patients. On closer inspection, the involvement 
of genes with altered functional interconnections 
within each cluster indicates that those genes are 
directly involved in the pathology (17). 

• These examples demonstrate that with the use of 
color-coded correlation mosaics, complicated 
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Figure 5. Reproducibility of the HVE gene co-expression in two unrelated sample groups: NC (normal controls) and TP (TRAPS patients). 
Normalized expression levels (ordinate) are presented against the numbers of samples in each group. Genes in the largest cluster (#1, A) in the 
NC group are also co-expressed in the TP group (B). Most of the genes belong to the largest cluster (#2, D) in the TP group. Conversely, genes in the 
largest cluster (#2, D) of the TP group are co-expressed in the NC group (C) and again almost entirely belong to the largest cluster of the NC group. 
The second largest cluster of the NC group #1 (E) is the inversion of the #1 cluster (a) in the NC. Genes are almost entirely in the second largest 
cluster (#1, F-H) of the TP group. The opposite is seen in (G and H). In contrast with the NC, Clusters #1 and #2 in the TP are not the reverse 
reflections of each other. 



7892 Nucleic Acids Research, 2011, Vol. 39, No. 18 



180 
160 
140 
120 
100 
80 
60 
40 

* 20 
0) 

c 

% o 
> 

1 60- 

50- 



Normal controls 



n In In , n ■ 



n -n n 



8 



40 



30 



20 



10 



Patients 



Exp data 
i i Simulated data 



Li 



1 2 3 4 5 6 7 8 

clusters 

Figure 6. Contents of the eight biggest clusters in the NC and TP 
groups (Figure 5) (black bars) compared with the same for the 
simulated data (data obtained by substitution of the real gene expres- 
sions with random values having the same SD and means for each gene 
over all samples in groups). 



interdependencies between genes can be visualized and 
differences between subgroups can be assessed. 
Correlation clustering is not just a procedure for gene 
partitioning into different compartments but is rather a 
combination of clustering and networking. This 
method provides a tool for quantitatively estimating 
interconnections between genes within clusters. 



Gene networking based on partial correlation coefficients 

Gene regulatory networks have become a major focus of 
interest in recent years. A number of reverse engineering 
approaches have been developed to help uncover these 
regulatory networks. Correlative mosaics demonstrate 
the existence of closely correlated modules, which are con- 
nected through positive or negative correlations. This type 
of presentation seems to be in good agreement with the 
widely discussed modularity of gene networks. In spite of 



this agreement some caution is necessary as the relatively 
high connectivities of gene clusters in correlation mosaic 
analysis mostly represent the indirect influences of a small 
number of regulatory elements. Information about direct 
interactions gives partial correlations that in turn enable 
to the distinguish of correlations between two variables 
that originate through direct influence versus correlation 
originated through the influence of intermediate variables. 
Partial correlation excludes many possibilities and usually 
significantly diminishes gene connectivity. We used this 
procedure for the networking of HVE-genes (18,20,21). 

Resume 4: Networking procedure based on partial 
correlations. The environmental circle for each gene is 
determined as a set of genes correlated with any given 
one having a correlation coefficient above threshold t x . 

The matrix of partial correlation coefficients within the 
environmental circle of genes is calculated. The elements 
of the matrix R^ represent the partial correlation coeffi- 
cients between the given gene and gene i with the removed 
influence of gene / All genes are within the given gene's 
environmental circle. 

The genes Q are considered to be causally intercon- 
nected with the given gene if the row Ry of the matrix 
does not have members below threshold t\, and if the 
averaged value of the row is above threshold r 2 . A 
Monte-Carlo simulation study is used to define the stat- 
istical thresholds (t x and t 2 ) below which partial correl- 
ation coefficients are likely due to chance. 

One example of the networking of HVE-genes was 
obtained during comparative analysis of the response to 
stimulation of EBV-transformed B cells derived from SLE 
patients and normal unrelated controls. Pathway Analysis 
allowed us to establish model networks of functional gene 
expression important for B cell signaling and elucidate 
gene expression regulatory interconnections disrupted in 
B cells from individuals with lupus (Dozmorov I, 
Dominguez N, Sestak AL, Xu HM, Harley JB, James 
J A, Guthridge JM manuscript in preparation). 
Fragments of this network that include genes uniquely 
activated in only one of these groups (controls or 
patients) are shown in Figure 9. These unique network 
fragments reproduced in two independent experiments 
present functional fingerprints of activated B cells from 
lupus patients and normal controls. In this context, one 
should note that practically all genes uniquely activated in 
normal controls (Figure 9A) are known as being 
'pro-apoptotic', while the genes uniquely activated in B 
cells from lupus patients (Figure 9B) are 'anti-apoptotic'. 
These results are in good agreement with the established 
defects of B cell apoptosis in lupus patients (27). 

TNF pathway modulation 

In another example this networking procedure was used to 
establish functional interconnections between HVE-genes 
in TRAPS pathology and normal control samples. 
HVE-genes demonstrating reproducible co-expression 
both in control and in TRAPS patients were selected 
(Supplementary Figure 3S). It is important to note that 
the majority of genes belonging to the largest cluster in 
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Figure 7. Mosaic of correlation coefficients of the HVE-genes in wild-type and NK1R-/- mice. The coordinates along axis are the numbers of genes 
listed in the left box. The white lines in A indicate the borders of three clusters of tightly interconnected genes. The colored lines and spots beyond 
the clusters represent positively linked genes (red) belonging to two or more clusters (Gene 5, for example), or negatively linked genes (blue). Genes 
that exhibited positive correlations over time were represented in graded shades of red, and genes negatively correlated are shown in graded shades of 
blue. Genes with an absence of correlation are indicated in green. Neprilysin is in the central position in the most prominent cluster found in 
wild-type mice, which includes a group of AP-1 responsive genes. In contrast, the association with these genes becomes negative in NK1R mice, 
who fail to mount antigen induced bladder inflammation. 



the control samples are also tightly clustered in the largest 
cluster of the patient samples. The close similarity of the 
contents of the largest clusters in two independently 
produced clustering procedures supports our hypothesis 
about common biological basis for such co-expression. 

F-means clustering of some genes associated with the 
TNF pathway are shown in Figure 10. Partial correlation 
coefficients were calculated for each pair of 42 selected 
genes. Two thresholds were used to select significant inter- 
connections. The threshold (ti) 0.7 was used to select the 
unique connections, and 0.5 was used for connections 
reproduced in the networks of both groups. The results 
of these calculations are presented in Figure 10A and B. 
The connections obtained with this method appeared to be 
consistent with current knowledge about this TNF 
pathway (Supplementary Figure S4 shows the pathway 
obtained with the use of Ingenuity Pathway Analysis). 
Interleukin-6 (IL-6) interconnections were expected 
based on the altered function of this cytokine in TRAPS 
pathology (28). The appearance of the MEFV gene in the 
TRAPS network is also interesting because mutations 
in this gene characterize another periodic fever, 
Mediterranean fever. 



DISCUSSION 

Microarray technology has revolutionized the study 
of biology by allowing the simultaneous examination of 
the expression profile of the entire genome. Gene expres- 
sion profiling enables rapid analysis of thousands of 
genes in parallel and has been used to establish 
many disease-specific fingerprints of pathology (29-31). 



Such profiling might facilitate the development of diag- 
nostic strategies for complex diseases, although one has 
to bear in mind that among hundreds of differentially ex- 
pressed genes, only a portion might play a critical role in 
pathology, while many others may have only bystander 
effects. The analysis of the disease processes requires 
methods that extend beyond comparing gene expression 
levels. The most exciting opportunity is to characterize 
pathology through changes in 'functional associations' 
among genes. Genes involved in such processes reveal 
extreme variability in their expression levels, thereby un- 
covering functional associations among them. As stated in 
the work from the Kauffman laboratory (1), random in- 
dependent inputs (as chaotic environmental perturbations 
are) allow for better recognition of regulatory associ- 
ations, and such identifications are more robustly resistant 
to noise. These properties make HVE-genes an important 
source of information about regulatory interconnections 
in biological systems. 

The most renowned problem in HVE-gene research is 
the absence of adequate statistical methods for the selec- 
tion and interpretation of HVE-genes (8). Among the 
most frequently employed statistical evaluations for 
HVE-genes are ANOVA methods, which are used to de- 
termine the fraction of genes significantly differentially 
expressed between individuals (32,33). These methods 
are simple and are based on commonly understood statis- 
tical principles. However, the problems of sensitivity and 
specificity prevent blindfolded application of these 
straightforward statistical methods to microarray 
analysis without previously determined corrections to 
the significance thresholds. 
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To address this issue, we have successfully implemented 
the Internal Standard strategy for differential gene expres- 
sion analysis (9) and developed optimal power analysis, 
including the estimation of replication requirements. 



Although we have presented several experimental conclu- 
sions within each project presented in this communication, 
some of them appear to be of general validity, and in turn 
they become solid attributes of gene expression analysis. 
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Figure 10. TNF pathway. Gene interconnection in both normal control (A) and TRAPS patients (B) obtained by calculating partial correlation 
coefficients. The solid lines represent positive interconnections with averaged partial correlation coefficients >0.7. The dashed lines represent inter- 
connections with negative partial correlation coefficients with averaged values <-0.7. The red lines represent interconnections significantly unique in 
each of the populations. 



We found that HVE-genes are true components of the 
process of gene expression regulation. Together, HVE- 
genes serve as an important source of information about 
the functional connectivity of the genome and about dy- 
namical processes based on this connectivity. 

The high incidence of expression variability, as well as 
the coherent appearance of this kind of expression, 



excludes the likelihood that this behavior occurs by 
chance (Figures 4-6). A striking feature of our findings 
is not only that a significant portion of genes are expressed 
hypervariably, but that the resulting patterns of variability 
are remarkably similar. These observations enable the ap- 
plication of standard clustering procedures to the analysis 
with the result that the contents of such clusters exceed 
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any chance coincidences. Additional evidence supporting 
the premises of our model includes the extraordinary high 
reproducibility of independently derived experimental 
sample groups (Figure 5 and Supplementary Figure S3). 

Our finding that many genes with high expression 
variabilities are associated exclusively with pathologies 
while the same set of genes display stable expression in 
normal samples (Figure 2) suggests the possibility that 
the mentioned pathologies are associated with a loss of 
control in transcriptional processes. However, this 
problem is awaiting careful investigation. Another 
surprising aspect of our findings is a functional relatedness 
among many of the studied HVE-genes. As an example, 
we point out that most genes demonstrating unique vari- 
ability in the periodic fever syndrome (TRAPS) are directly 
associated with inflammatory processes (Figure 2A and 
Supplementary Figure SI). 

In addition, almost all of the genes that are uniquely 
variable in samples from lupus patients have 
anti-apoptotic activity, whereas genes uniquely variable 
in control samples all have distinct pro-apoptotic 
activity (Figure 9). This result is in strong agreement 
with the known fact that B cells of lupus patients have 
defects in apoptosis (27). 

Application of the networking procedure to the 
HVE-genes selected from samples from TRAPS patients 
and normal controls produced remarkably reproducible 
associations among genes of the TNF pathway. The few 
differences between the 'pathological' and 'normal' 
networks are consistent with the established features of 
this pathology (6,34). 

We are committed to the viewpoint that the biological 
reality of hyper-variations in gene expression forms a solid 
basis for the analysis of biological objects. For example: 

• Statistically significant differences in the variabilities of 
HVE-genes as compared with the majority of relatively 
stable genes in an array (Figure 1A) exclude the pos- 
sibility that such fluctuations are due to chance. 

• Many HVE-genes have very similar expression 
profiles, thereby enabling the identification of large 
clusters of co-expressed genes (Figures 4 and 5). The 
sizes of such clusters significantly exceed the sizes of 
clusters in simulated random sets of data (Figure 6). 

• Some groups of co-expressed genes are highly repro- 
ducible, appearing to be only slightly altered in differ- 
ent groups of samples (Figure 5 and Supplementary 
Figure S3). 

• The clusters of co-expressed HVE-genes present 
groups of genes joined by their participation in 
regular biological processes (Figures 7-9). 

As we have shown in various applications, these 
features of HVE-genes make them a very important 
source of information regarding functional interconnec- 
tions in biological systems and processes. 

Various pathologies associated with the stimulation of 
defense functions (e.g. inflammation and autoimmunity) 
increased the proportions of the HVE-genes in compari- 
son with the relatively quiet control state (Figure 2). It is 
possible that an analogy with the temperature of physical 



bodies could be drawn with regard to the increased 
mobility of such pathologies. 

Considering that HVE-genes are a presentation of 
internal dynamic processes, it is possible to employ the 
usual methods of analyses for these processes, including 
clustering and networking approaches usually applied to 
the study of temporal dynamics. Genes could be gathered 
into groups of co-expressed genes by conventional cluster- 
ing procedures. Such clusters contain HVE-genes 
associated with common biological processes and signal- 
ing pathways. Loss or change of membership in these 
clusters by one or several genes could be a hallmark of 
pathology-associated alterations, as demonstrated in 
Figure 7. 

We usually observe more than one large cluster of 
HVE-genes with possible functional associations, which 
substantiates the coexistence of different internal 
dynamics. For example, we often observe the presence 
of two large clusters with anti-correlated profiles 
(Figure 5, see also Figure 2C). Such anti-correlation indi- 
cates that these two dynamic processes exist not as inde- 
pendent phenomena but as compensatory reactions to 
mutual changes. Deviation from the stability of genes 
within one group is accompanied by a corresponding 
and opposite change by the genes in another cluster. 
Alterations in such compensatory reactions could also 
be important hallmarks of pathology. 

The sum of two anti-correlated profiles is constant, and 
this invariability is maintained in the coordinated vari- 
ations of the profiles, i.e. the changes in one profile are 
compensated by opposite changes in another. In this situ- 
ation, it is possible that a more complicated form of com- 
pensatory reactions, incorporating the involvement of 
more than two clusters or HVE-genes with different 
dynamic profiles, is occurring. Examples of such associ- 
ations were obtained through linear discriminatory 
analysis for the classification of sample groups. Dynamic 
discriminant function analysis was developed based on the 
concept that stable classification parameters (roots) can be 
derived from highly variable gene-expression data (35). 
We demonstrated earlier that the functional interconnec- 
tions between HVE discriminatory genes can be presented 
in the form of functional networks that exhibit distinct- 
ive changes in pathology cases when compared to 
controls (35). 

In conclusion, the analysis of the coordinated behavior 
of HVE-genes can resolve the very important clinical 
problem of non-homogeneity in sample groups that 
consist of patients with phenotypically similar syndromes. 
Such discrimination and exclusion of homogeneity is es- 
pecially important in characterizing the phases of path- 
ology development and the changes in the course of 
response to the treatment and in discriminating hidden 
pathologies when a disease with common clinical charac- 
teristics can include pathologies of different molecular 
mechanisms. 
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